Measuring Search Retrieval Accuracy of Uncorrected OCR: Findings from the Harvard-Radcliffe Online Historical Reference Shelf Digitization Project
ثبت نشده
چکیده
This report presents the findings of an investigation to evaluate the conditions for search retrieval successes and failures when using uncorrected OCR for indexing. The purpose of the study was to assess whether low-cost, high-production techniques for text conversion were adequate to produce digital reproductions of consistent quality and usability. We sought to identify attributes of the original material or the OCR-produced text that could predict when additional, costly processes (OCR correction or keying) would be needed to meet retrieval requirements for text digitization projects.
منابع مشابه
Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm
This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Characte...
متن کاملRetrieval of Spelling Variants in Nonstandard Texts – Automated Support and Visualization
This article describes ongoing research in the RSNSR (Regelbasierte Suche in Textdatenbanken mit nichtstandardisierter Rechtschreibung, “Rule-based search in text databases with nonstandard orthography”) project. The focus of this project is making historical text documents digitally available; consequently, it examines the challenges for digitization procedures and subsequent retrieval operati...
متن کاملمطالعۀ سیر تکاملی حوزۀ «خدمات و منابع مرجع» با استفاده از طیفسنجی سال انتشار مآخذ
Purpose: To identify major events in the development of Reference Services literature. Methodology: Reference Publication Year Spectroscopy (RPYS) technique is used. Initial data was obtained from the Scopus by scientometrics method. A comprehensive search strategy led to the retrieval of 5007 records. RPYS software was used to revise data. Excel application was used for visualization of findi...
متن کاملUsing Text Surrounding Method to Enhance Retrieval of Online Images by Google Search Engine
Purpose: the current research aimed to compare the effectiveness of various tags and codes for retrieving images from the Google. Design/methodology: selected images with different characteristics in a registered domain were carefully studied. The exception was that special conceptual features have been apportioned for each group of images separately. In this regard, each group image surr...
متن کاملEstimating Digitization Costs in Digital Libraries Using DiCoMo
The estimate of digitization costs is a very difficult task. It is difficult to make exact predictions due to the great quantity of unknown factors. However, digitization projects need to have a precise idea of the economic costs and the times involved in the development of their contents. The common practice when we start digitizing a new collection is to set a schedule, and a firm commitment ...
متن کامل